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FAST FUNCTIONAL INTEGRALS WITH APPLICATION TO 
DIFFERENTIAL EQUATIONS MODELS 

JOHN TILLINGHAST 


1. Abstract 

A new method is introduced which uses higher-order Laplace approximation to evaluate 
functional integrals much faster than existing methods. An implementation in MATLAB 
is called SLAM-FIT (Sparse Laplace Approximation Method for Functional Integration on 
Time) or simply SLAM. In this paper SLAM is applied to estimate parameters of mixed 
models that require functional integration. It is compared with two more general packages 
which can be used to do functional integration. One is Stan, a recent and very general 
package for integrating and estimating using hybrid Monte Carlo. The other is INLA, a 
recent R package which uses Laplace approximations for Gaussian Markov random fields. 
In both cases it is able to get near-identical or equivalent results in less time for moderately 
sized data sets. The fundamental speed advantage of the algorithm may be greater than 
it appears, because SLAM is running in pure MATLAB while the other two packages use 
optimized compiled code. 

Keywords: compartment models, predator-prey, SIR model, path integral, functional 
integral, Laplace approximation, saddlepoint method. 


2. Introduction 


This paper has two purposes: to introduce a new, much faster computational method 
for functional integration, and to show its usefulness in estimating parameters of differen¬ 
tial equation models. For a time-dependent random process, a functional integral can be 
thought of as the integral over all possible values at all times over the relevant time inter¬ 
val 


Dirac, 1933, Feynman and Hibbs 


2010, 

Schulman 

2005 

Kleinert 


they are integrals over a function space such as a space of Brownian motion paths. They 
have been used for at least eighty years in many branches of science and applied mathemat¬ 
ics including quantum and statistical mechanics, polymer science, probability, and finance. 
Typically they are used to compute the chance of a system evolving from one state to a 
different one: any in-between path is possible, but the probability is given by the integral 
over the possible paths. This makes them a natural tool to use for analyzing systems 
which are imperfectly modeled by ordinary differential equations. In life science there are 


predator-prey models Edelstein-Keshet 

1988 , infectious disease models Kermack and 

McKendrick 

1927, 

Anderson and May 

1991 , pharmacokinetics models |Gelman et al. 

1996 , and many others. Most of macroeconomics depends on differential equation-based 
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models such as the Solow growth model or the ISLM model Romer, 2011| . These models 
can be made more realistic by considering noise or stochastic behavior. 

A major goal of differential equations models is to estimate system parameters, such 
as how infectious a virus is, or reaction rates for chemicals and enzymes. Traditional ap¬ 
proaches involve simplified models on transformed variables Lineweaver and Burk 1934| or 
optimizing some measure of data fit over exact solutions to the ODEs [Anderson and May 


1991 . Recent methods use a generalized smoothing approach. For a finite-dimensional 


basis, such as splines, it is possible to approximately fit both the differential equations and 
the data Ramsay J. et al., 2007, Campbell and Steele, 2012 . As with smoothing, the trade¬ 


off between data fit and ODE fit can be chosen by cross-validation or by a mixed-model 
approach. 

In this paper, we use a very different approach: we discretize time, and model the 
underlying true values as a first-order Markov process. Then we assume that the data were 
generated with a distribution depending on the true value. Not all of the time points have 
data; in fact, it may be important to add in-between time points, just as they would be 
needed for a finite difference method. Then we define a marginalized pseudolikelihood of 
the system parameters as an integral over the possible values which the variables might 
have taken at the time points. This approximates a functional integral over a continuous 
random process. 

There are both analytic and computational methods for evaluating functional integrals. 
Analytic methods require a tractable integrand or a good approximation to one. Very 
clever schemes have been painstakingly developed to transform functional integrals and 
make them tractable [Kleinert 2009, Schulman, 2005 . Far more problems have so far 
required time-consuming Monte Carlo methods. 

Laplace and saddlepoint approximations have also been used to approximate analytic 
functional integrals when possible. But this appears to be the first time that the approxi¬ 
mations have been used for numerical evaluation. One reason for this may be that some of 
the higher-order terms, involving third- and fourth-order tensors, are difficult to evaluate 
efficiently. 

This paper introduces a way to compute the higher-order terms quickly for numerical 
functional integrals. This requires computing a critical path (a unique most likely realiza¬ 
tion) and expanding the log-likelihood around it. The log-likelihood is a sum of interactions 
between neighboring time points. Consequently the relevant Hessian and the tensors of 
third- and fourth-order derivatives are sparse with a simple block structure: all entries 
corresponding to non-neighboring time points are zero. By using this sparsity structure, 
it is possible to calculate the higher-order terms in 0{np^) time, where n is the number of 
time points and p is the number of variables in the system. 

A variance-stabilizing transformation is needed when the data times are separated by 
many in-between time steps. In general it makes the integrals more accurate than when 
untransformed. SLAM can be used with the basic Laplace term alone, or with the higher- 
order terms. Higher-order terms make the integrals more accurate, but for the examples 
tested, the parameter estimates are close to those found using the basic Laplace approxi¬ 
mation. 
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To evaluate speed and accuracy, SLAM was compared with two popular systems for 
mixed models. Results for an infectious disease data set are equivalent to results from 
Stan [STAN Development Team , the leading general-purpose Monte Carlo system, but 
SLAM is much faster. There is also a well-implemented Laplace-based method, INLA [Rue 


et al., 2009, Rue et al. , designed for hierarchical models and Gaussian Markov random 


fields. 

For random fields, INLA takes advantage of sparsity, but the higher-order terms still 
take quadratic time. The special approximation introduced here can only be applied to 
GMRFs on a line, not for other type covered by INLA. To compare INLA and SLAM, 
we used an INLA demonstration data set (Tokyo rainfall by day of year). For repeats of 
the Tokyo data, estimates are well within the Cl for the two methods. INLA (with full 
higher-order terms) is faster at 1100 data points but SLAM is faster beyond that. INLA 
using simplified higher-order terms remains somewhat faster up to the largest set tested 
(about 15000 data points). 


3. METHODS 

3.1. Markov process setup. The model has two stages: a Markov process for the true 
values, and on top of that, measurements with noise. We assume that the Markov and 
measurement likelihoods depend on parameters 6. We show how to define a marginalized 
likelihood for 6. Maximizing the marginalized likelihood gives an estimate of 6. 

In this paper we will use Latin indexes (“i”) to denote time points and matrix blocks. 
We use Greek indexes (“a”) to denote individual vector and matrix entries. Each time 
point has as one entry for every variable. This becomes important when analyzing the 
blocking structure. We also use the summation convention for duplicated indexes (e.g. 
means ^«/ 3 ^q;/ 37 ') 

The Markov process has a vector value at each time point, denoted by y* at step i (time 
ti). (We use y without an index to mean the entire realization, i.e., the concatenation 
of the Yi.) The system also has parameters 6 that involved in the Markov process or 
the measurement error. From the Markov process, at time tj+i there is a conditional 
probability density for y^+i, given the value of y*: 

ftrans (yi+l|yi, 0 ) = exp{ - ^trans (yi+i|yu^)) 

so given the initial values yi, we have the overall conditional density for a realization y at 
all time points: 


n—l 

fdyn (y|yi) ft vans (yi+i|yu^) 

i=l 

If we have a prior 7r(yi) we can define the probability of a realization: 


fdyn{y\0) = Triyi)fdyn (y|yi, 0) 
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Some of the time points have data. Let Idata be the set of their indexes. Call the data 
values y* for i € Idata- We assume that the data are independently generated with a 
probability density depending on the true value and parameters: 


fdata (y*|yi, 0) = exp{- Idata (y*|yi, 0) ) 

Depending on the system, the error could be measurement error from an instrument, or 
from sampling error, or from some other source. The overall density of the given realization 
and data is 


n—1 

f (y ? y I '^(yi) fdyn (y^+i lyi ; n fdataiyi ly^; 

i^Idata 

This means that the pdf of y* given 6 is given by 


~ n—1 

f{y*\^)= T^{yi)T\fdyn{yi+i\yi,0) IT fdata{y*i\yi,0)d^y 

( n-1 
log 7r(yi) - ^ idata {yi\yi,d) - ^ itrans {yi+i\yi,6) 

^^^data ^ 1 

Given 7r(yi) it is possible to define a marginal likelihood for 6 by integrating f {y*\0) 
over the latent true values. But in general, we don’t know '7r(yi), and we use the improper 
prior 7r(yi) = 1. The integral is finite because of the multiplication by fdata which can be 
thought of as a prior on the data points. This gives us the marginalized pseudolikelihood 


where 


M(0) 


n—1 


Ylfdyn{yi+l\yi,0) fdata{yi\yi,0)d^y 

^=1 '^^^data 




Af-l 

dy)= E idata{yi |y*) T ^ ^ itrans (yi+l|yi) • 

This marginalized pseudolikelihood is what we will maximize to estimate 6. 

3.2. Laplace approximation (basic and higher-order). In this subsection, we look 
closely at how to get a good approximation for M. In addition, assume that i{y) has a 
single peak at y. Then we can expand around y: 
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y = y+ e 


M = 



e-^(y)d^y 



^iy+^)d^e 


= exp (^-i (y) - ^£^3 (y) (y) " •••) 

= e-^(y) j ^ 


where H is the Hessian of l{y) at y, and T and F are the tensors of third and fourth 
derivatives at y. 

We can think of £ as a Gaussian random variable, with precision matrix H, and this 
integral is the expectation of a function of e: 


M = e ^(y)(27r)^/2|jj| V2^ jg^p - ...^ | 

The expectation can be expanded and then approximated with an asymptotic series. 


f 1/1 1 \' 

M OC F \ ^ ^ ^ \^^^Fq^p^£qi£ p£^ + ~^FQ^p^x^a^ T ■■■J 


. r=0 

= EU- 




1 /I 


+ ^ i ^^a/3jA^a^/3^y^A + ••• ) + ••• 


"^1 ^ \ 21 ^a/37^Q:^/3^7 “1“ ^^Poi^yA^a.^^^y^A ••• 




21 ) Fap^Tx^y£a£p£^£x ^T ••• | 


1 1 / 1 \ ^ 

~ 1 ~^Fap^xF (^£a£p£'y£x') ^ ^ ) F'ap^Tx^uF {£(x£p£'^£x ^T •■• 

This series diverges, but the terms shown can produce a good approximation to M on 
their own. A similar, often better approximation is given by a cumulant expansion for 
log M [Shun and McCullagh 1995, McCullagh, 1987 , where the first higher-order terms 
are the same as above, but the product is turned into a sum: 

















6 


JOHN TILLINGHAST 


N 1 

log M -i (y) + —log {2 tx) - -log\H\ 

1 f iV rr. rr^ TJ^/ 

+ ... 

N 1 

= (y) + Y^'^5' (2^) “ 2 ^og\il\ + IV + Ilia + Illb + ... 

where 


IV = 

Ilia = — ■ 9II^j3 Tafiyll^^ 

Illb = ^ • QT^p,H-lH-plH-^Tx,. 


We use the cumulant expansion, for log M, because the derivatives are simpler and 
because in many cases it is more accurate [Shun and McCullagh 1995 


Terms Ilia and Illb involve the same tensors, but the sums are very different. In term 
Ilia, as long as we have the near-diagonal entries of H~^, we can immediately convert the 


third-order tensors into vectors = H^^Ta/s-y- Then Ilia = v'^H ^v. In sectio n 


we 


explain how to compute the near-diagonal entries in 0{np^) time Asif and Moura, 2005 
getting v'^H^^v is even faster because H is block-tridiagonal. 

Term Illb is different and much more difficult. Each connects one mode of the first 
T to a mode of the second T, and H~^ is full. This means that entries from one time 
point of the first T are multiplied by entries from all other time points in the second T, not 
just entries from the neighboring time points. This would seem to mean that Illb requires 
quadratic time (in n) to compute. The appendix shows how it can be done in linear time 
by using power series and recurrence relations. 

Graphically, the difference can be represented this way: 


Ilia = 


o .o 


Illb = T 


•H 


■H 


■H 


Here each line represents a contraction over two modes: one from each tensor, if there 
are two, or two from a single tensor. This notation will be helpful later when doing more 
complicated manipulations on term Illb. 

It is worth emphasizing that all three of these terms are invariant under linear transfor¬ 
mations. If we make a change of variables y = By, then 
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dy dy 

and the derivatives tensors (at the critical path) become 


Haj3 — 


d‘^i 


dyadyi3 




H = B^^HB ^ 


= T,.pB-^B;^B^^ T = (B-t ® B-t ® B-T)(T) 

Fai3^\ = F^^p^B-^B-^B-^B-l F = (B'"^ ® B^'^ (g) B"'^ 0 B''^)(F) 

Then the sum -^Fai 3 ^xH~^H~l is still equal to term IV, ^H~jfap^H~lfxp,uH~^ is still 
term Ilia, etc. In the simplified matrix and graphical notation, the transformation of T 
and F can be written as 


B 1 B'^ B-i 

/ \ / 

T = B~^ — T F = F 

\ / \ 

B 1 B-t B-^ 


In reality, both tensors are being multiplied symmetrically in all modes. The reason for 
using transposes is because if a matrix is on the left side, its rows are multiplied, and if a 
matrix is on the right side, its columns are multiplied. This is consistent with the usual 
direction of matrix multiplication. Using the same convention will make things simpler in 
the Appendix. 


4. The Sparsity Structure and How to Use It 


Take another look at the log likelihood £{y) (3.1). Each term in idata uses values from 
one time point. Each term in idyn uses values from two adjacent time points. 

This means that if i is any time point, the blocks Hi^j, and can be nonzero. 

Such a matrix is called block- 


d'^e 


= 0 


pxp- 


For any other j, i.e. if |i - jj > 1, Hij = 
tridiagonal (see figure]^. 

Likewise, Tij^k is zero unless max(|i — jj, \j — k\,\i — k\) < 1. And Fij^k,i is zero if any 
of the ( 2 ) = 6 differences is greater than 1. 


4.1. Setup, block-diagonals, block off-diagonals, levels. In this subsection we see 
how to work with H, then apply that to computing IV and Ilia. The details for Illb are 
worked out in the appendix. 

If y is a true local minimum of i, then H has to be positive definite. This is crucial, 
because we need the Cholesky decomposition for the calculation. 

Here D is block-diagonal but not symmetric. In fact, D is lower-triangular. A is block 
off-diagonal, level -1. This means that Ai j can only be nonzero i = j + 1. 
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/ Pi 

Qi^ 

0 

0 

\ 


Qi 

P2 

Q2^ 

0 



0 

Q2 

P 3 

Q3^ 



V 0 

0 

Q 3 

P 

4 

/ 

= LL'^, where 






/ Di 

0 

0 

0 \ 




El 

D2 

0 

0 




0 

E 2 

Da 

0 




V 0 

0 

E 3 D4 y 



= D + E 






= D(I + D 

) 




= Dfl- 

A) 






Figure 1. Decomposition of H. 

If X has level lx = 1, and Y has level ly = —2, then XY has level I = lx + W = ~1- The 
third block of level XY is zero because Y has only 2 nonzero blocks. 


/ 0 Xi 

0 

0 \ 


/ 

0 

0 

0 

0 \ 

0 0 

X2 

0 



0 

0 

0 

0 

0 0 

0 

Xa 


Yi 

0 

0 

0 

\ 0 0 

0 

0 


V 

0 

Y2 

0 

0/ 

/ 0 

0 

Q 

0 \ 





X2Y1 

0 

0 

0 





0 

X3Y2 0 

0 





V 0 

0 

0 

0 







Figure 2. Off-diagonal matrix multiplication 


In general, the np x np matrix X is block off-diagonal, level I, if the blocks Xj j ^ 0 only 
when j = i + 1. 

Block-off-diagonals have a raising and lowering property, demonstrated in Figure If 
X is block off-diagonal with level lx, and Y is block off-diagonal with level ly, then XY is 
block off-diagonal with level lx + Iy- “Raising the level by 1” happens when you multiply 
on either size by a block off-diagonal matrix of level 1. “Lowering by 1” is the same as 
“raising by -1”, which means multiplying by a block off-diagonal of level -I. 

4.2. Computing near-diagonal elements of Moving on with the computation. 
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Tt^T 


H = D(I-A)(I-A) D 
= D~^(I-A)"'^(I- 


The power series for (I — A) ^ actually terminates because A is strictly lower-triangular. 
So we can expand and group terms: 


H 


D 


-T 


Ea'"" D-' 


q>0 


r>0 


q>0r>0 

The terms of the double sum can be grouped by level so as to give us the near-dia gonal 
levels of This results in an algorithm similar to the block-tridiagonal case of Asif 


and Moura, 2005 


Each term in the double sum has exactly one level, q — r. Now we will group the terms 
by level. Break the sum into the two cases, q < r and q > r: 




g>0 


r>Q'>0 

^J.>0,q>0 i^>0,r>0 

AT'iA'?A'^) +Y{Y1 A^^^A^A' 

^>0 

Y SA'" + S + ^ A'^'^S 

/ i >0 i />0 


where 


S = ^A'^‘iA'1. 

<?>o 

Now we will show how to compute S in 0{np^) time by a block matrix recurrence 
relation. 

Obviously, 


S = 1 +A^SA 

The map S' —)• A^SA has a special property: each block of A'^SA only depends on the 
following block of S, and the last block of A'^SA is zero. 
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/ 0 

Ai^ • • • 

0 

0 \ 


/ Si 

0 


0 0 \ 


( ° 

0 

0 


0 

0 Aa^ 

0 

0 


0 

S2 


0 0 


Ai 

0 

0 

0 












A 2 



0 

0 

0 



0 

0 


• Sn_i 0 


0 

0 

0 

0 

V 0 

0 

0 

0 / 


1 0 

0 


• 0 Sn y 


V 0 

0 

• • An_i 

0 / 







/ 

A^SaAi 

0 



0 

0 \ 








• o 

> 

S 3 A 2 •• 


0 

0 








0 

0 



iSnAn—1 

0 







V 

0 

0 



0 

0/ 


Because of this property, we can solve S = I + A'^SA by a reverse iteration: 


Sn = I 

Sk = I + A^Sk+iAk 


for k going from n — 1 down to 1. 

This gives a total of 0{n) block multiplications, each of which takes up to 0{p^) flops 
for total complexity 0(np^). There may be special cases when there are less than 0{p^) 
flops per block, but that would be somewhat unusual because A is generated by a Cholesky 
decomposition. 

Having computed S, we now have 

^SA'^ + S + ^A^'^S 

/^>0 ^>0 

We will never actually compute all the entries of - it is full and would need too 
many flops. For terms IV and Ilia, we only need the block-tridiagonal part, 

D-'^(^SA + S +A'^S)D”^ 

which has 0{np^) entries and takes 0{np^) time to compute. 

4.3. Computing IV and Ilia. Term IV is simple: each nonzero element of F occurs 
exactly once, and is multiplied by block-tridiagonal elements of H~^. F has O(np^) nonzero 
terms, and the number of operations is clearly bounded by 0{np^). It may be less if the 
blocks of F are themselves sparse. 

Term Ilia is slightly more complicated. The first part is comparable to term 

IV: each nonzero element of T occurs once. This takes up to 0{np^) time and returns a 
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vector V with Then term Ilia is proportional to v^H~^vx = v^L~^L“^v = 

||L~^v|p. Solving L~^v takes 0{np^) time since L is triangular and block-banded. 


5. Numerical Examples and Experiments 

5.1. An Example: Poisson-Distributed Growth and the Need for Variance Sta¬ 
bilization. Imagine a bacteria colony with a known rate of division per minute, 6. The 
deterministic equation of growth would be 


dt 


= Oy 


But we know that the process is not really deterministic, and that the bacteria count 
and number of divisions are integers. So a reasonable model is that the number of divisions 
over the short time Ajt is Poisson distributed with mean OyiAit: 


Aiy ~ Poiss{9yiAit) 

E{Aiy) = Var{Aiy) = OyiAit 

But SLAM can’t handle discrete variables directly: the Laplace approximation uses an 
integral over the values of the yi. The obvious fix is to use the normal approximation to 
the Poisson distribution: 


Aiy ~ M{0yiAit, 

(■trans{yi+l\yi) = ^ (^Zofif(27r%Ait) ) 

^(y) = X] ^data{yi\y*i) + \'^ log{2'K9yiAit) + ev^Ad"^'^ 

.V-r, A^\T A^\T ^ 


where is the data-based likelihood of y* given y*. For now we assume that 

it peaks at y*. 

Now we should be able to find the critical path y that minimizes f(y), take derivatives, 
and compute the various Laplace terms. And presumably, for small enough time steps, 
this would give a good approximation to the stochastic differential equation. 

Unfortunately, without a further tweak, even this simple example can fail precisely 
because of using small time steps between the data points. If there are too many consecutive 
time points between the data times, i.e. too many points without data, you can get the 
problem shown in figure Increasing N tends to drive y lower and lower; if there are 
enough points between data points, y can approach zero for some time points. 

This doesn’t just mean that the critical paths get ugly; they are actually getting farther 
and farther from any realistic paths that the system might take. This is possible because 
the probability density of the path is large for y close to zero, even though the total 
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Time (arbitrary) 


Time (arbitrary) 


Figure 3. Poisson bacterial count figure. 

probability measure of that region is still quite small. The problem stops being good for a 
Laplace-type approach. 

Fortunately, there is a way around this problem. We start by identifying the reason for 
it, then explain the solution for this example. 

The explanation below is a heuristic, not a theorem, but it is based on actual observation. 
Rearranging the expression for .^(y), we get 

^(y) = log{2TrAit) ^ idataiVi) + log{yi) + 

A^AT «V- r . A^AT A^AT ^ 


The first term is constant with respect to y and does not affect the minimization of £(y}. 
The next term is a data term; it only depends on yi at the data points, and it actually 
increases when y^ goes below from y*. So to understand the evolution of the critical path 
with increasing N, we have to look to the other two terms. 

The third term (the log{y) term) typically increases without bound as N ^ oo. For 
example, if there is a limiting path y{t), with evenly spaced time points, ~ 

N{log{y)) where {log{y)) is the mean of log y{t). 

But the last term does not increase without bound. In practice. 


1 

2 


E 

i<N 


{Ajy - OyiAjtf 
OyiAit 
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behaves like a Riemann sum approaching its limiting value-an integral. This is because, 
in practice, for A^t small, Ajy — OyiAit ~ O(Ajt). (If all of the critical paths were along 
the same smooth function y{t), then Aiy — OyiAit would be O(Ajt^). This doesn’t happen 
here because the yi come from critical paths for different partitions. Essentially this sum 
behaves more like it would for a Brownian motion than for a smooth function.) 

Using Aiy - OyiAit ~ O(Ajt), we get 

{\y ~ OyiAit)"^ ^ (O(Ajt)^) 

^ OyiAit ^ OyiAit 

AT-l 

= E 

i=\ 

if the yi are uniformly bounded below (uniformly for different N). 

For this reason, as iV —)■ oo, the log-likelihood is dominated by the third term, the sum 
over values of log yi. This is why the critical path keeps being pushed lower and lower in 
order to minimize i. 

This creates a problem: for accuracy with a continuous process, we may need small time 
steps, but with small time steps we get this artifact. It arises because the variance of i/j+i 
depends on y*: a smaller y* gives a smaller variance, which gives a greater likelihood. 

The solution is to make a change of variables in the integral. If we pick the right 
transformation v = u(y), we can stabilize the variance of v and get a realistic critical path 
to expand around. If Vi = v{yi), and Var{yi) = OyiAit is small. 


Var{vi+i) ~ v'{yif ■ Var{yi+i) 
= v'{yif • yiAit 

so we can stabilize the variance of v if 


v'iyfy = 1 
v'{yf = - 


V = 2^ 

So Vi = 2^/yi has stable variance. How does this affect the integral for M (0) ? 
If we let 


_ 1 (Ajy - PyjAitf 
2 ^ OyiAit 


then 
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Ldyniy) Ot (n ® 

and the marginalized likelihood is proportional to 



i<N V y* 


Ldata{y)e • dyiv 

i<Af V 


Ldata{y)e-^^^^^ n d (2V^) 

i<N 


F is always positive, so e~^ < 1. Ldata is maximized when yi = y*. There is an 
additional factor of nw, but realistically it cannot go to infinity as fast as Ldata and e~^ 
go to zero for large vn- 

That means that this integrand, unlike the pre-transformation integrand, is bounded 
and has some reasonable critical path in terms of v. 


5.2. SIR model. Our primary model for testing comes from infectious disease epidemiol¬ 
ogy. One of the simplest widely-used models is the SIR model [Kermack and McKendriA 


1927 . There are three compartments, Susceptibles, Infected, and Recovered. At each 


time step, some Susceptibles become Infected, and some Infected become Recovered. Two 
parameters correspond to infectiousness (/3) and speed of recovery ( 7 ). The traditional 
method of estimating these parameters uses the deterministic equations 


dt 

dJ 

dt 

dR 

dt 


-psi 

fdSI - -fl 

jl 


The assumptions behind this are simple: 

( 1 ) new infections are proportional to contacts between susceptible and infected; 

( 2 ) new recoveries are proportional to the number infected. 

Assumption (2) is memoryless, which is not very realistic, but is often adequate [Ander- 


son and May [TMTl 


To convert this to a stochastic model, we treat new infections at time i {vh) and new 
recoveries at time i [vRi) as independent Poisson variables with means given by the deter¬ 
ministic model: 
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Uli = Euii = PSJiAit 
vii ~ Pois{l3SiIiAit) 
fiRi = EuRi = -fhAit 
VRi ~ Pois{-fIiAit) 


As with the bacterial colony, we use the normal approximation: 


Uli ~ 

URi ~ 

AiS = 




AJ = 


r\j 


l^Ri) 

( M/U /^/*) 

Rli — RRi 

-AiS - VRi 

N{ — AiS — fiRi, fJ,Ri) 


We assume the measurement error is lognormal with a third parameter, a: 


I-'data{^7 1 |*^) 


n 

'^^^data 


27ra^S*I* 


exp 






The dynamical part of the likelihood is 
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n—1 


Ldyn{S,l\P,j) = n 77^ 


\/27r^/j y 2 fij{ 


1 {AiS + 


2 


1 {Ail + AiS + p^RiY 


IJ^Ri 


n—1 ^ 

= (27r)-^/2 n ^_ exp ( - F{S„ h, g.+i, J.+i|/3, 7 , A,t)) 


where 


Jl'fQ T Q T \ 1 (^*5'+ 1 + AjS" + 


2 /i/i 2 


P^Ri 


so 


n—1 


n—1 


Ldyn = (2vr) exp ( - ^ F{Si,Ii, 5*+!, li+i)) TT 
^ i ^ r= 1 


= {2Ti)-^^l^e 


-N/2-F,u{S,1) 


(271) 


-N/2 


VWy 


-Faii{S,I) 


n—1 ^ 

n ^pSiliAit ■ ■yliAit 

n—1 ^ 

n_i_ 


As with the bacteria model 15.1, the critical path often gets artifacts for small time steps 
and the approximation can be poor. So we rewrite the differential as 


n—1 


Ldyn rs dr 


dSi dli 


I<xeM-Faii) TT ' 

PV aiij 11 ^/-A 


t 


* dSTi dlr) 


n—1 


oc 


dSi dli 


expi-Faii) TT 

i=i A 


so we want to pick transformed variables vs and vj such that 


, dS , dl 
dvQ oc dvT oc — 

Vs I 


If we take 


vs = 2VS, VI = log{I) 

then our integral avoids the artifacts explained in 
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5.3. Comparison with Rstan on British Boarding School data. Stan [STAN Devel- 


opment Team is a tool for doing Bayesian statistics using Hybrid Monte Carlo estimation. 


Among other things, it can estimate Bayesian posteriors for parameters in very general 
models, which can be specified with an easy-to-use general model language. 

This makes Stan a natural check for the approximation methods in SLAM. The proba¬ 
bilistic model in §5.2| was defined in Stan as well as SLAM, and used to estimate parameters 
for a classic data set Murray, 2002 . The purpose is to verify that the estimates comes out 


similar, but that SLAM can make the calcnlations more quickly because of the approxima¬ 
tion. The posterior median given by Stan is not identical to the maximum marginalized 
likelihood estimate given by SLAM. However, for small variances, we expect them to be 
close. 

The results are shown in the table below along with the corresponding results from 
SLAM. The test data set comes from an influenza infection at a British boarding school (M ur[ - 
|ray[ |200^ . Initial guesses for /? and 7 are set equal to Murray’s deterministic estimates. 
The initial guess for a (the measurement error parameter) is arbitrarily set to 0.1. 

Figure shows the estimates of the course of infection given by SLAM, Stan, and the 
deterministic ODEs, given their estimated parameter valnes. The deterministic model 
suggests that the epidemic could not end as quickly as it did. Stan and SLAM can follow 
the data more closely using the stochastic model from §5.2[ The Stan and SLAM paths are 
very similar, but we don’t expect them to be identical becanse Stan is showing posterior 
means while SLAM is showing a overall maximum likelihood estimate. 



Figure 4. Comparison of Stan and SLAM predictions. The deterministic 
prediction follows the exact ODEs using the best fit values of f3 and 7 
provided by Murray. Stan shows a posterior mean, while SLAM uses a 
maximum likelihood path. 
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/3 

7 

a 

Time 

Deterministic (Murray) 

2.18 X 10“^ 

0.440 

- 


Stan GI Low 

2.13 X 10"^ 

0.462 

0.092 


Stan posterior median 

2.48 X 10"^ 

0.518 

0.194 

61.3 s 

SLAM basic 

2.47 X 10"3 

0.519 

0.175 

1.7 s 

SLAM higher order 

2.47 X 10"3 

0.519 

0.176 

4.5 s 

Stan GI High 

2.94 X 10"^ 

0.601 

0.400 



Table 1. SIR model results. 


Table [T] shows the results of the test. As expected, the parameter estimates from Stan 
and SLAM are very similar. For /3 and 7 , the differences are negligible. For a the Stan 
and SLAM estimates are within a fraction of the Cl, or abont 10% of the value. But for 
this problem, compared with Stan, basic SLAM was about 35 times faster than Stan, and 
higher-order SLAM was more than 13 times faster. 

For this test, Rstan 2.8.0 was run in R 3.2.2, using 4 chains of length 2000 each, with 
options set to maximize speed (multicore, optimized compilation). Different versions of 
Stan code for the model were tested. Snrprisingly, the fastest version used unvectorized 
code. 


5.4. Comparison with INLA on Tokyo rain data. INLA Rue et al. 2009, Rue et ah] 


is an R package which also uses Laplace approximations to compute integrals and estimate 
parameters, but for a different class of problems (Gaussian Markov Random Fields [Rue 


and Held, 2005] ). GMRFs on a line can be analyzed by both INLA and SLAM. INLA also 


uses sparsity, but the higher-order terms are calculated in a simpler (and more general) 
way which takes quadratic time and memory in the number of nodes. Therefore SLAM, 
with a linear-time algorithm for the higher-order terms, should be faster for sufficiently 
large GMRFs on a line. These experiments show that this is true for repeats of the Tokyo 
data set Martino and Rue, 2009 from the INLA package. 


The Tokyo data set looks at Tokyo rainfall from the start of 1983 to the end of 1984. 
Each row of Tokyo gives a calendar day (e.g. Jannary 28) and the number of times it rained 
on that day in either 1983 or 1984. For obvious reasons this count is always 2 except for 
February 29 (from 1984). The top few rows are presented in Table 


i 

m 

Vi 

1 

2 

0 

2 

2 

0 

3 

2 

1 

4 

2 

1 

5 

2 

0 


Table 2 . Initial rows of Tokyo rain data. 
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Here i is the calendar date (e.g. 1 is January 1), n* is the number of times that date 
occurred, and yi is the number of rain days for that date. 

In the INLA manual [Martino and Rue 2009|, the Tokyo data set is analyzed using 


INLA with a second-order random walk model ( Rue and Held 2005 , section 3.4.1). For 


each date z, the chance of rain is assumed to be some pi, so the chance of y* rain days for 
that date is —pi)'^'~^'■ The second-order random walk is over x* = log{pil{l —pi), 

so i for given A is 


N ^ N-2 ^ 

^(x|A) = ( - yiXi -I- rii log{l + + 2 X] (^-^og X + -^ixi +2 - 2xi+i -h Xif'^ 

i=l ^ i=l * 

After slight modification^ we compared the behavior and performance of INLA and 
SLAM. 

In order to adjust the size of the data set, we simply repeated the data set k times, e.g., 
for k=2 there are 732 data points instead of 366 and yi +366 = Vi- 

By default, INLA uses an approximation to the higher order terms which is easier to 
compute (strategy=‘simplified.laplace’). To get the full higher order Laplace terms, as used 
by SLAM, we set INLA strategy = ‘laplace’. 

After running both packages on Tokyo, the estimates are not identical but are very 
similar, especially for the larger data sets (see figures). The most important results here 
have to do with run time and especially memory use. For k up to 3, INLA is faster. At 
k=4 {N ~ 1.5 X 10^), SLAM becomes faster. The relative difference grows as shown in 
figure 

The SSE for the simplified Laplace is several times larger than for INLA full or for 
SLAM. This is almost entirely due to greater error near the ends of the year (1 and 366). 

The expected number of rain days over the year is riiPi and the observed number of 
rain days is YiVi- SLAM these are equal to at least four digits. This probably means 
that, under some conditions, the method forces them to be equal. For now we make no 
suggestion why. 

Lastly, in this example we had to introduce a variable x' which is supposed to be the 
time derivative of the linear predictor x. We forced x' to be close to the true derivative by 
adding a large penalty for any difference between x' and Ajx/Ajt. These results show that 
it is (sometimes) possible to use SLAM for a constrained system using this simple trick. 

^Three modifications were made in order to work around features that are not yet present in the SLAM 
code. 

1. SLAM doesn’t take second derivatives, so an extra variable x'i is added which is forced to be nearly 

equal to the true Then the usual RW2 penalty is given in terms of 

2. INLA has an option “cyclic” which forces the function to be periodic. SLAM does not currently have 
such an option. Therefore cyclic is set to FALSE for INLA. Happily, the predicted functions are nearly 
periodic. 

3. The Tokyo data set covers 1983-84, so rii = 2 for all dates except February 29. For Feb. 29 = 1 

because only 1984 was a leap year. For now, it was necessary to set rii = 2 for February 29 for both 
programs. Think of it as a recording error. 
















20 


JOHN TILLINGHAST 


10.4 
10.2 

10 

9.8 

9.6 

9.4 
. ^-2 

9 

8.8 

8.6 

8.4 
8.2 

8 

7 . 8 , 


1 —'—I—'—I—^—r 


/■ 

- / 

/ / 


— INLA cyclic mean 

— INLA mean 

INLA confidence interval 

— SLAM estimate 


2000 4000 


_L 


J_ 


6000 8000 10000 12000 
Data points in set 


_L 


14000 16000 


Figure 5. Comparison of INLA estimates, the means and ranges. 



Figure 6 . Comparison of INLA and SLAM run times. 

6 . Discussion 

We have shown how to efficiently compute a higher-order Laplace approximation for 
functional integrals on time. This can be used to define a marginalized pseudolikelihood for 
differential equation systems involving randomness. In turn, these approximate marginal¬ 
ized likelihoods can be maximized in order to estimate parameters of the system, such as 
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Figure 7. Linear predictors. 

infectiousness of a disease, or a good roughness parameter for a smoothing problem. This 
approximation is implemented in the MATLAB package SLAM. SLAM is compared on 
real data against two leading packages which also estimate/predict parameters according 
to a fully Bayesian system or mixed model. As predicted, the time needed for SLAM 
seems to grow linearly. Compared with either package, for medium-sized data sets using 
higher-order terms, SLAM produces equivalent results in far less time. This is especially 
interesting given that both Stan and INLA use compiled, optimized code. Finally, for some 
problems, if higher-order terms are used then SLAM is able to process significantly larger 
data sets than INLA. 
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8 . Appendix: Calculation of Term Ills in linear time 

8.1. Tensor levels. This shows how to compute term HIb in 0{np^) time. This is possible 
because of the raising and lowering principle described in section A similar concept 
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applies to block tensors. A tensor of order 3, X, is block off-diagonal with level (/i, I 2 , 13 ) if 
every nonzero block has ii—li = h—h = h — h- In other words, X is level {li, I 2 , 13 ) 

if it is block off-diagonal and includes the block {li, I 2 , 13 ). Levels are an equivalence class: 
level {li, I 2 , 13 ) is the same as level (/i -|- A/, I 2 + A.I, I 3 + Al) for any integer Al. 

The levels of a third-order block tensor are shifted when powers of A or are applied 
to the modes. Our T is a sum of a handful of block-diagonal and block off-diagonal levels. 
To compute Illb, we find the combinations of powers of A and A^ which connect the levels 
of the first T with the levels of the second T. The grouping is more complicated than for 
computing the near-diagonals of but is similar in spirit. 

Suppose have a single-level matrix B and a single-level tensor Tmono with level (/i, h, h)- 
If B is level Ib, then the new tensor (B (g) I (g) I) (Tmono) is also block off-diagonal, but with 
level {Ii+Ib, h, h)- The next observation is critical: if B is applied to every mode of Tmono, 
the result (B (g) B (g) B) (Tmono) is the same exact level as Tmono- 

8.2. A simplifying coordinate transformation. As we discussed in section we can 
write term Illb more simply by making a coordinate transformation: 



\ 

1 

/ 

T —— T = 

t -H-i- 


/ 

1 

\ 


where 


t = (D"'^ (g) (g) D-'^)(T). 

D is block-diagonal with block size p, so the coordinate transformation takes 0(np‘^) 
time. 

To continue, we will use the series 


= S^A'^ + S + ^A' 

/ i >0 ^>0 


Tuc 


We simplify this by making a second block-diagonal coordinate transformation to eliminate 
S. Note that S is block-diagonal and symmetric because each of its terms A^'^A'^ is block- 
diagonal and symmetric. S is also positive definite, since I is positive definite and every 
A^'^A'i is semidefinite. So if we define C to be the (upper!) Cholesky decomposition 
S = C^C. Then = C, SC~^ = and = I. Now we can make the 

coordinate transformation 


y^CV, H = CThC 


from which 
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H -1 = 


= C-^ (^S ^ A/" + S + ^ 

/i >0 i^>0 

= C-^ (s Y, A'") + I + C-^ ( Y A^'^s) C'^ 

/1>0 !^>0 

The first and third terms are transposes of each other. Let’s look at the first term alone. 


C-^(SY^^)C~^ = C(^^A'"jC-i 

/ i >0 / i >0 

= Y (cAf^C-^ 

/ i >0 

= 

/ i >0 

with A = C”^AC. At last, 

n ^ = Y^'" + '^ + Y 

fM>0 ^>0 

= ^ A'" + ^ A^'' - I 

fi>0 z ^>0 

= A + A^-l 

where A = A. 

8.3. Statement of Results. 

71/6 = 2(T,5(t)) 

+ 6(t, (l ® (I + A + A^) ® (A + A^)) (5(t))) 
+ 6(t, ((I + A) ® A ® A^) (t)) 

-(t,t) 

where (P, Q) = Pa^QaiS'^ 
and 5(T) is defined as 

S{T) = ^ (^A'l ® A'l ® A^i) (t). 

q>0 

5(T) can be computed in 0{np^) time, and so can the whole sum. 
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8.4. How to expand the sum and group the terms. Let’s start by defining a more 
compact notation. 



This has the properties of multilinearity and symmetry under permutation; 



Furthermore, if all the matrices are transposed, then you get the same value: 



Multilinearity and permutation symmetry mean that we can expand our bracket of sums 
in the same way as with a product of sums. 
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^ + - II 

^ - I > 

A + A^ -I) 





Combining transposes lets us group the terms together further: 



The calculation of 



is trivial: it’s just (T, T). The other two involve grouping the tiny 


minority of nonzero subterms in the first two terms. The principle to remember is that 
none of the nonzero levels are more than two apart in any of the 3 modes. This means 
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1 


that < 

a4 

1 

[a^J 

g + s or r + 


^Ts 


= 0 if either 


A 

A-1 

A^-1 


'I + A + + 

A + A^ + ... 
A + A'^2 + .. 

'I + A' 

A 

AT 


A 

A-1 

AT-I 


E.>o 

X,>„A" 


^s>0 ■ 

f 

EEE A' 

q>0 r>0 s>0 I, AT- 


2—s 2—5 


r A^ 

EEE A' 

s>0 q=0 r=l [_AT® 


All terms have power r, s > 1. But this implies that the only nonzero terms have g,r, s < 1. 
If g were greater than 1, we would have g + s>l + s>2. r<l for similar reasons. And 
if s were greater than 1 we would have s + r>l + r>2. Consequently, this triply-infinite 
sum reduces to 



111 

EEE 


q=0 r=l 5=1 




Finally, we reach the first sum, which is also the most complicated: 



^ ^ ^ 
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rA'i 

EEE A- 

q>0r>0s>0 I A® 


E < 


>+3E < 

A^ 

>+3E < 

A^ 

•+«iE I 

q=r=s 

[a®J 

q=r<s 

[a®J 

q<_r=s 

[a®J 

q<r<s \ 


A^' 

A s 


= E< 

fA^l 

A^ 

•+3E( 

q>0 

[A^j 

q>0 ' 

= E< 

A^ 

>+3Ej 

q>0 

[A^j 

g>o 1 

= E< 

fA^l 

A^ 

•+3Ej 

q>0 

[A^j 

q>o \ 

= E< 

A^ 

>+3Ej 

q>0 

[a^} 

q>o \ 


f AM f A<^] 


A^ ] 


^^q+2j / 

g>0 V [a'i+iJ 


A^ 

^q+2 

^q+2 


I. 


A^ 

+ 6 ^ ^ A^i+i 

g>0 I A‘1+2 


A^ 

A^ 


I 


A^ 


+ 3 ^ < A'l+i + A^+^ 


q>0 


A^ 


Using 


AR+1 + A^+^ 
A^ 

[ + A + A^)A^ 
(A + A^)A^ 


S{T) = Y,[A^^ A'lj (t) 

q>0 


then these terms become 


fA^i' 

E r 

q>0 [A^ 

A^ 

^ (I + A + 

<?>0 I (A + A2)A'i 


(T,5(T)) 

(t, fl ® (I + A + A^) ® (A + A2) ) (5(T))) 


Putting the whole expression together, 
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(A + A^ -1' 

Illb = {A + A^ -1 
[A + A^-l 

(A] f 1 [I 

= 2^^i+6^ ^-I >-<I 

ll 

= 2f (t, 5(t)) + 3(t, fl 0 (I + A + A2) ® (A + 


I + A' 

+ 6 A 

at 


Like the computation in near-diags there is a simple, linear recurrence relation for 5(T). 
First, 5(T) solves a linear equation: 


5(t) = t + ^ (^A^i 0 A'l ® A^j (t) 

q>0 

= T + (^A® A® A)(5(t)) 

In block notation, this gives us a recurrence relation similar to ??. This is a forward 
recurrence instead of backwards because of 


— TjjTj + ^ ^ ® ^jj' ® (‘5(T)) 

i'j'k' 

= Tijk if i=l or j=l or k=l 

— ^ijk “b ^A 2 ^ 2—1 ^ ^ ) otherwise 

The recurrence is forward instead of backward because we chose to use J2q>o (^A'l® A'l® 

A'^^(T) rather than X]ij>o ® ® A'^‘^^(T) for the main sum. Each successive 

block of 5(T) is computed in terms of a known block of T and (if present) the previous 
block of 5(T) from the same level. There are 7 block levels, the number of blocks per level 
is 0(n), and the number of operations in a matrix-times-block operation is 0{p^), so the 
complexity of finding 5(T) is O(np^). 

The computation can be accelerated by separating the levels of T and making greater 
use of symmetry, but it will still be 0{np‘^). 
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